Commit Graph

573 Commits

Author SHA1 Message Date
Ilya Dryomov
f140662f35 crush: decode and initialize chooseleaf_vary_r
Commit e2b149cc4b ("crush: add chooseleaf_vary_r tunable") added the
crush_map::chooseleaf_vary_r field but missed the decode part.  This
lead to misdirected requests caused by incorrect raw crush mapping
sets.

Fixes: http://tracker.ceph.com/issues/8226

Reported-and-Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-05-16 21:29:55 +04:00
Chunwei Chen
178eda29ca libceph: fix corruption when using page_count 0 page in rbd
It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

https://github.com/zfsonlinux/spl/issues/241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <sage@inktank.com>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-05-16 21:29:26 +04:00
Linus Torvalds
5575eeb7b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "First, there is a critical fix for the new primary-affinity function
  that went into -rc1.

  The second batch of patches from Zheng fix a range of problems with
  directory fragmentation, readdir, and a few odds and ends for cephfs"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: reserve caps for file layout/lock MDS requests
  ceph: avoid releasing caps that are being used
  ceph: clear directory's completeness when creating file
  libceph: fix non-default values check in apply_primary_affinity()
  ceph: use fpos_cmp() to compare dentry positions
  ceph: check directory's completeness before emitting directory entry
2014-05-05 15:17:02 -07:00
Ilya Dryomov
92b2e75158 libceph: fix non-default values check in apply_primary_affinity()
osd_primary_affinity array is indexed into incorrectly when checking
for non-default primary-affinity values.  This nullifies the impact of
the rest of the apply_primary_affinity() and results in misdirected
requests.

                if (osds[i] != CRUSH_ITEM_NONE &&
                    osdmap->osd_primary_affinity[i] !=
                                                ^^^
                                        CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {

For a pool with size 2, this always ends up checking osd0 and osd1
primary_affinity values, instead of the values that correspond to the
osds in question.  E.g., given a [2,3] up set and a [max,max,0,max]
primary affinity vector, requests are still sent to osd2, because both
osd0 and osd1 happen to have max primary_affinity values and therefore
we return from apply_primary_affinity() early on the premise that all
osds in the given set have max (default) values.  Fix it.

Fixes: http://tracker.ceph.com/issues/7954

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-28 12:54:10 -07:00
David S. Miller
676d23690f net: Fix use after free by removing length arg from sk_data_ready callbacks.
Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11 16:15:36 -04:00
Linus Torvalds
240cd6a817 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "The biggest chunk is a series of patches from Ilya that add support
  for new Ceph osd and crush map features, including some new tunables,
  primary affinity, and the new encoding that is needed for erasure
  coding support.  This brings things into parity with the server side
  and the looming firefly release.  There is also support for allocation
  hints in RBD that help limit fragmentation on the server side.

  There is also a series of patches from Zheng fixing NFS reexport,
  directory fragmentation support, flock vs fnctl behavior, and some
  issues with clustered MDS.

  Finally, there are some miscellaneous fixes from Yunchuan Wen for
  fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
  behavior"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
  ceph: skip invalid dentry during dcache readdir
  libceph: dump pool {read,write}_tier to debugfs
  libceph: output primary affinity values on osdmap updates
  ceph: flush cap release queue when trimming session caps
  ceph: don't grabs open file reference for aborted request
  ceph: drop extra open file reference in ceph_atomic_open()
  ceph: preallocate buffer for readdir reply
  libceph: enable PRIMARY_AFFINITY feature bit
  libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
  libceph: add support for osd primary affinity
  libceph: add support for primary_temp mappings
  libceph: return primary from ceph_calc_pg_acting()
  libceph: switch ceph_calc_pg_acting() to new helpers
  libceph: introduce apply_temps() helper
  libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  libceph: ceph_can_shift_osds(pool) and pool type defines
  libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  libceph: enable OSDMAP_ENC feature bit
  libceph: primary_affinity decode bits
  libceph: primary_affinity infrastructure
  ...
2014-04-07 11:09:13 -07:00
Ilya Dryomov
8a53f23fcd libceph: dump pool {read,write}_tier to debugfs
Dump pool {read,write}_tier to debugfs.  While at it, fixup printk type
specifiers and remove the unnecessary cast to unsigned long long.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-04-04 21:08:29 -07:00
Ilya Dryomov
f31da0f3e1 libceph: output primary affinity values on osdmap updates
Similar to osd weights, output primary affinity values on incremental
osdmap updates.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-04-04 21:08:28 -07:00
Ilya Dryomov
c4c1228525 libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:19 -07:00
Ilya Dryomov
47ec1f3cc4 libceph: add support for osd primary affinity
Respond to non-default primary_affinity values accordingly.  (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:17 -07:00
Ilya Dryomov
5e8d4d36bf libceph: add support for primary_temp mappings
Change apply_temp() to override primary in the same way pg_temp
overrides osd set.  primary_temp overrides pg_temp primary too.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:16 -07:00
Ilya Dryomov
8008ab1080 libceph: return primary from ceph_calc_pg_acting()
In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly.  Primary is now specified separately
from the order of osds in the set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:14 -07:00
Ilya Dryomov
ac972230e2 libceph: switch ceph_calc_pg_acting() to new helpers
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:13 -07:00
Ilya Dryomov
45966c3467 libceph: introduce apply_temps() helper
apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:11 -07:00
Ilya Dryomov
2bd93d4d7e libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.

raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:10 -07:00
Ilya Dryomov
63a6993f52 libceph: primary_affinity decode bits
Add two helpers to decode primary_affinity (full map, vector<u32>) and
new_primary_affinity (inc map, map<u32, u32>) and switch to them.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:04 -07:00
Ilya Dryomov
2cfa34f2d6 libceph: primary_affinity infrastructure
Add primary_affinity infrastructure.  primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.

Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:02 -07:00
Ilya Dryomov
d286de796a libceph: primary_temp decode bits
Add a common helper to decode both primary_temp (full map, map<pg_t,
u32>) and new_primary_temp (inc map, same) and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:00 -07:00
Ilya Dryomov
9686f94c8c libceph: primary_temp infrastructure
Add primary_temp mappings infrastructure.  struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.

Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:

    primary_temp 2.6 4

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:58 -07:00
Ilya Dryomov
35a935d75d libceph: generalize ceph_pg_mapping
In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:57 -07:00
Ilya Dryomov
ec7af97258 libceph: introduce get_osdmap_client_data_v()
Full and incremental osdmaps are structured identically and have
identical headers.  Add a helper to decode both "old" (16-bit version,
v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
enconding headers and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:55 -07:00
Ilya Dryomov
10db634e20 libceph: introduce decode{,_new}_pg_temp() and switch to them
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
(inc map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:53 -07:00
Ilya Dryomov
4d60351f90 libceph: switch osdmap_set_max_osd() to krealloc()
Use krealloc() instead of rolling our own.  (krealloc() with a NULL
first argument acts as a kmalloc()).  Properly initalize the new array
elements.  This is needed to make future additions to osdmap easier.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:52 -07:00
Ilya Dryomov
433fbdd31d libceph: introduce decode{,_new}_pools() and switch to them
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:50 -07:00
Ilya Dryomov
0f70c7eedb libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
To be in line with all the other osdmap decode helpers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:49 -07:00